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Abstract 



In this paper, we present algorithms that perform gradient ascent of the average reward in a par- 
tially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, 
an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased 
estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it 
uses only one free parameter j3 6 [0, 1), which has a natural interpretation in terms of bias-variance 
trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, 
control and observation spaces. We show how the gradient estimates produced by GPOMDP can 
be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with 
an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in 
line searches. Experimental results are presented illustrating both the theoretical results of Baxter 
and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more 
realistic problems. 

1. Introduction 

Function approximation is necessary to avoid the curse of dimensionality associated with large- 
scale dynamic programming and reinforcement learning problems. The dominant paradigm is to 
use the function to approximate the state (or state and action) values. Most algorithms then seek to 
minimize some form of error between the approximate value function and the true value function, 
usually by simulation (Sutton & Barto, 1998; Bertsekas & Tsitsiklis, 1996). While there have been 
a multitude of empirical successes for this approach (for example, Samuel, 1959; Tesauro, 1992, 
1994; Baxter, Tridgell, & Weaver, 2000; Zhang & Dietterich, 1995; Singh & Bertsekas, 1997), there 
are only weak theoretical guarantees on the peiformance of the policy generated by the approximate 
value function. In paiticular, there is no guarantee that the policy will improve as the approximate 
value function is trained; in fact performance can degrade even when the function class contains an 
approximate value function whose corresponding greedy policy is optimal (see Baxter & Bartlett, 
2001, Appendix A, for a simple two-state example). 

An alternative technique that has received increased attention recently is the "policy-gradient" 
approach in which the parameters of a stochastic policy are adjusted in the direction of the gradient 
of some performance criterion (typically either expected discounted reward or average reward). The 
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key problem is how to compute the perfonnance gradient under conditions of partial observability 
when an explicit model of the system is not available. 

This question has been addressed in a large body of previous work (Barto, Sutton, & Anderson, 
1983; Williams, 1992; Glynn, 1986; Cao & Chen, 1997; Cao & Wan, 1998; Fu & Hu, 1994; 
Singh, Jaakkola, & Jordan, 1994, 1995; Marbach & Tsitsiklis, 1998; Marbach, 1998; Baird & 
Moore, 1999; Rubinstein & Melamed, 1998; Kimura, Yamamura, & Kobayashi, 1995; Kimura, 
Miyazaki, & Kobayashi, 1997). See the introduction of (Baxter & Bartlett, 2001) for a discussion 
of the history of policy-gradient approaches. Most existing algorithms rely on the existence of an 
identifiable recurrent state in order to make their updates to the gradient estimate, and the variance 
of the algorithms is governed by the recurrence time to that state. In cases where the recurrence time 
is too lai^ge (for instance because the state space is large), or in situations of partial observability 
where such a state cannot be reliably identified, we need to seek alternatives that do not require 
access to such a state. 

Motivated by these considerations, Baxter and Bartlett (2001, 2000) introduced and analysed 
GPOMDP — an algorithm for generating a biased estimate of the gradient of the average reward in 
general Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized 
stochastic policies. The chief advantages of GPOMDP are that it requires only a single sample path 
of the underlying Markov chain, it uses only one free parameter ^ G [0, 1), which has a natural 
interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying 
state. 

More specifically, suppose 6 G M*- are the parameters controlling the POMDP. For example, 6 
could be the parameters of an approximate neural-network value-function that generates a stochastic 
policy by some form of randomized look-ahead, or 6 could be the parameters of an approximate Q 
function used to stochastically select controls^. Let r}[6) denote the average reward of the POMDP 
with parameter setting 6. GPOMDP computes an approximation Vp'q{9) to V'q{0) based on a single 
continuous sample path of the underlying Markov chain. The accuracy of the approximation is 
controlled by the parameter /3 G [0, 1), and one can show that 

Vr?(0) = limV^7y(e). 

The trade-off preventing choosing ^ arbitrarily close to 1 is that the variance of GPOMDP's esti- 
mates of Vp'q{0) scale as 1/(1 — However, on the bright side, it can also be shown that the bias 
ofVij{6) (measured by ||\^77(^) — V77(0)||) is proportional to r(l — j3) where r is a suitable mixing 
time of the Markov chain underlying the POMDP (Bartlett & Baxter, 2000a). Thus for "rapidly 
mixing" POMDP's (for which r is small), estimates of the performance gradient with acceptable 
bias and variance can be obtained. 

Provided Vp'q{6) is a sufficiently accurate approximation to Vr/(0) — in fact, Vpr]{0) need only 
be within 90° of Vr]{6) — small adjustments to the parameters 6 in the direction \7ijri{6) will guar- 
antee improvement in the average reward r]{9). In this case, gradient-based optimization algorithms 
using Vfjr]{6) as their gradient estimate will be guaranteed to improve the average reward r]{9) on 
each step. Except in the case of table-lookup, most value-function based approaches to reinforce- 
ment learning cannot make this guarantee. 

In this paper we present a conjugate-gradient ascent algorithm that uses the estimates of Vpr]{9) 
provided by GPOMDP. Critical to the successful operation of the algorithm is a novel line search 

1. Stochastic policies are not strictly necessary in our framework, but thie policy must be "differentiable" in the sense 
that Vr]{9) exists. 
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subroutine that brackets maxima by relying solely upon gradient estimates. This largely avoids 
problems associated with finding the maximum using noisy value estimates. Since the parameters 
are only updated after accumulating sufficiently accurate estimates of the gradient direction, we refer 
to this approach as the "off-line" algorithm. This approach essentially allows us to take a stochastic 
gradient optimization problem and treat it as a non-stochastic optimization problem, thus enabling 
the use of a large body of accumulated heuristics and algorithmic improvements associated with 
such methods. We also present a more traditional, "on-line" stochastic gradient ascent algorithm 
based on GPOMDP that updates the parameters at every time step. This algorithm is essentially the 
algorithm proposed in (Kimura et al., 1997). 

The off-line and on-line algorithms are applied to a variety of problems, beginning with a simple 
3-state Markov decision process (MDP) controlled by a linear function for which the true gradient 
can be exactly computed. We show rapid convergence of the gradient estimates Vijr}[6) to the true 
gradient, in this case over a large range of values of /3. With this simple system we are able to 
illustrate vividly the bias/variance tradeoff associated with the selection of (3. We then compare the 
performance of the off-line and on-line approaches applied to finding a good policy for the MDP. 
The off-line algorithm reliably finds a near-optimal policy in less than 100 iterations of the Markov 
chain, an order of magnitude faster than the on-line approach. This can be attributed to the more 
aggressive exploitation of the gradient information by the off-line method. 

Next we demonstrate the effectiveness of the off-line algorithm in training a neural network 
controller to control a "puck" in a two-dimensional world. The task in this case is to reliably 
navigate the puck from any starting configuration to an arbitrary target location in the minimum 
time, while only applying discrete forces in the x and y directions. Although the on-line algorithm 
was tried for this problem, convergence was considerably slower and we were not able to reliably 
find a good local optimum. 

In the third experiment, we use the off-line algorithm to train a controller for the call admission 
queueing problem treated in (Marbach, 1998). In this case near-optimal solutions are found within 
about 2000 iterations of the underlying queue, 1-2 orders of magnitude faster than the experiments 
reported in (Marbach, 1998) with on-line (stochastic-gradient) algorithms. 

In the fourth and final experiment, the off-line algorithm was used to reliably train a switched 
neural-network controller for a two-dimensional variation on the classical "mountain-car" task (Sut- 
ton & Barto, 1998, Example 8.2). 

The rest of this paper is organized as follows. In Section 2 we introduce POMDPs controlled by 
stochastic policies, and the assumptions needed for our algorithms to apply. GPOMDP is described 
in Section 3. In Section 4 we describe the off-line and on-line gradient-ascent algorithms, including 
the gradient-based line-search subroutine. Experimental results are presented in Section 5. 

2. POMDPs Controlled by Stochastic Policies 

A partially observable, Markov decision process (POMDP) consists of a state space S, observation 
space 3^ and a control space U. For each state « G 5 there is a deterministic reward r[i). Although 
the results in Baxter and Bartlett (2001) only guarantee convergence of GPOMDP in the case of 
finite S (but continuous U and 3^), the algorithm can be applied regai^dless of the nature of S so we 
do not restrict the cardinality of S, U or 3^. 

Consider first the case of discrete S, U and y. Each control u ^ U determines a stochastic 
matrix P{u) = \pij{u)] giving the transition probability from state i to state j (i, j G S). For each 
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state i G 5, an observation Y € y is generated independently according to a probability distribution 
^{i) over observations in y. We denote the probability that Y = y by i>y{i). A randomized policy 
is simply a function ^ mapping observations into probability distributions over the controls U. That 
is, for each observation y & y, fj,{y) is a distribution over the controls in U. Denote the probability 
under ^ of control u given observation y by ^„ (y) . 

For continuous S,y and U, Pij{u) becomes a kernel kij{u) giving the probability density of 
transitions from i to j, ^{i) becomes a probability density function on y with h'y{i) the density at y, 
and ij,{y) becomes a probability density function on U with fiuiy) the density at u. 

To each randomized policy there corresponds a Markov chain in which state transitions are 
generated by first selecting an observation Y in state i according to the distribution ^{i), then se- 
lecting a control U according to the distribution fi{Y), and finally generating a transition to state j 
according to the probability Pij{U). 

At present we are only dealing with a fixed POMDP. To parameterize the POMDP we pa- 
rameterize the policies, so that /i now becomes a function fi{9, y) of a set of parameters 6 G M^, 
as well as of the observation y. The Markov chain corresponding to 6 has state transition matrix 
P{0) = given by 

Pij{6) = 'EY^v(i)^u^ti(e,Y)Pii{U)- (1) 

Note that the policies ^ are purely reactive or memoryless in that their choice of action is based only 
upon the current observation. All the experiments described in the present paper use purely reactive 
pohcies. Aberdeen and Baxter (2001) have extended GPOMDP and the techniques of the present 
paper to controllers with internal state. 

The following technical assumptions are required for the operation of GPOMDP. 

Assumption 1. The derivatives, 
exist, and the ratios 

l^u{0,y) 

are uniformly bounded by B < oo, for all u & U, y & y, 6 & M.^ and k = I, . . . , K. 

The second part of this assumption is needed because the ratio appears in the GPOMDP algo- 
rithm. It allows zero-probability actions fj,u{0, y) = only if V^m(^, y) is also zero, in which case 
we set 0/0 = 0. See Section 5 for examples of policies satisfying this requirement. 

Assumption 2. The magnitudes of the rewards, \r{i)\, are uniformly bounded by R < oo for all 
states i. 

For deterministic rewards, his condition only represents a restriction in infinite state spaces. 
However, all the results in the present paper apply to bounded stochastic rewards, in which case r{i) 
is the expectation of the reward in state i. 

Assumption 3. Each P{6),6 G M.^ , has a unique stationary distribution 7t{6) = [tti (0, . . . , 7r„ (6)], 
satisfying the balance equations: 

Tr{e)p{e) = Tr{e). 
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Assumption 3 ensures that, for all parameters 0, the Markov chain forms a single recurrent class. 
Since any finite-state Markov chain always ends up in a recurrent class, and it is the properties of 
this class that determine the long-term average reward, this assumption is mainly for convenience 
so that we do not have to include the recunence class as a quantifier in our theorems. Observe 
that episodic problems, such as the minimization of time to a goal state, may be modeled in a way 
that satisfies Assumption 3 by simply resetting the agent upon reaching the goal state back to some 
initial starting distribution over states. Examples are described in Section 5. 

The average reward rj{6) is simply the expected reward under the stationary distribution ^{6): 

n 

77(0) = J]7r,(0)r(i). (2) 

2 = 1 

Because of Assumption 3, r]{6) is also equal to the expected long-term average of the reward re- 
ceived when starting from any state i: 



( 1 ^"^ 

7?(0)= lim E 



t=0 



Here the expectation is over sequences of states Xq, . . . , Xt-i with state transitions generated by 
P{6) (note that the expectation is independent of the starting state i). 



3. The GPOMDP Algorithm 

GPOMDP (Algorithm 1) is an algorithm for computing a biased estimate Ay of the gradient of the 
average reward Vri{6). Ay satisfies 

lim At = V^?7(0), 

T— >-oo 

where V^'q{0) (/3 G [0, 1)) is an approximation to Vri{6) satisfying 

VrKe) = lim V^r,(0), 

(Baxter & Bartlett, 2001, Theorems 2, 5). Note that GPOMDP reUes only upon a single sample path 
from the POMDP. Also, it does not require knowledge of the transition probability matrix P, nor of 
the observation process v; it only requires knowledge of the randomized policy fi, in particular the 
ability to compute the gradient of the probability of the chosen control divided by the probability of 
the chosen control. 

We cannot set /3 ai^bitraiily close to 1 in GPOMDP, since the variance of the estimate is pro- 
portional to 1/(1 — However, on the bright side, it can also be shown that the bias of Vp{6) 
(measured by \\Vijr]{6) — V r}{6) ||) is proportional to r(l — /3) where r is a suitable mixing time of the 
Markov chain underlying the POMDP (Bartlett & Baxter, 2000a). Under Assumption 3, regardless 
of the initial starting state, the distribution over states converges to the stationary distribution 7r(0) 
when the agent is following policy 1^(0, ■). Standai^d Mai^kov chain theory shows that the rate of 
convergence to Tr{6) is exponential, and loosely speaking, the mixing time r is the time constant in 
the exponential decay. 
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Algorithm 1 GP0MDP(/3, T, 8) 



1: Given: 

• /?e[o,i). 

• T > 0. 

• Parameters $ eM.^ . 

• Randomized policy ij,{6,-) satisfying Assumption 1. 

• POMDP with rewards satisfying Assumption 2, and which when controlled by 
generates stochastic matrices P{6) satisfying Assumption 3. 

• Arbitrary (unknown) starting state Xq. 

Set zo = and Aq = (zq, Aq G R^). 
for t = to T - 1 do 

Observe Yt (generated according to the observation distribution u{Xt)) 
Generate control Ut according to ^{6,Yt) 

Observe r{Xt+i) (where the next state Xt+i is generated according to pxtXt+iiUt))- 



7: Set zt+i = Pzt + 



Set Ai+i = At + r{Xt+i)zt+i 
end for 
At ^ At/T 
return At 



Thus /3 has a natural interpretation in terms of a bias/variance trade-off: small values of /3 
give lower variance in the estimates At, but higher bias in that the expectation of A^ may be far 
from \/r]{6), whereas values of /3 close to 1 yield small bias but correspondingly larger variance. 
Fortunately, for problems which mix rapidly (small r), (3 can be small and still yield reasonable 
bias. This bias/vaiiance trade-off is vividly illustrated in the experiments of Section 5; see (Bartlett 
& Baxter, 2000a) for a more detailed theoretical discussion of the bias/variance question. 



4. Stochastic Gradient Ascent Algorithms 

This section introduces two approaches to exploiting the gradient estimates produced by GPOMDP: 



1. an off-line approach based on traditional conjugate-gradient optimization techniques but em- 
ploying a novel line-search mechanism to cope with the noise in GPOMDP's estimates, and 



2. an on-line stochastic optimization approach that uses the core update in GPOMDP {r{Xt)zt) 
to update the parameters 9 on every iteration of the POMDP. 
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4.1 Off-line optimization of the average reward 

GPOMDP generates biased and noisy estimates Aj- of the gradient of the average reward 'S/r]{6) for 
POMDPs controlled by parameterized stochastic policies. A straightforward algorithm for finding 
local maxima of rj{6) would be to compute Ar(^) at the current parameter settings 6, and then 
modify 9hy -(r- 0+^At{O). Provided At{9) is close enough to the true gradient direction Vr;(0), 
and provided the step-sizes 7 are suitably decreasing, standard stochastic optimization theory tells us 
that this technique will converge to a local maximum of 'r]{6). However, given that each computation 
of At{0) requires many iterations of the POMDP to guai^antee suitably accurate gradient estimates 
(that is, in general T needs to be large), we would like to more aggressively exploit the information 
contained in At{0) than by simply adjusting the parameters ^ by a small amount in the direction 
At(0). 

There are two techniques for making better use of gradient information that are widely used in 
non-stochastic optimization: better choice of the search direction and better choice of step size. Bet- 
ter search directions can be found by employing conjugate-gradient directions rather than the pure 
gradient direction. Better step sizes are usually obtained by performing some kind of line-search to 
find a local maximum in the search direction, or through the use of second order methods. Since 
line-seairh techniques tend to be more robust to departures from quadraticity in the optimization 
surface, we will only consider those here (however, see Baxter & Baitlett, 2001, Section 7.3, for a 
discussion of how second-order derivatives may be computed with a GPOMDP-Uke algorithm). 

CONJPOMDP, described in Algorithm 2, is a version of the Polak-Ribiere conjugate-gradient 
algorithm (see, e.g. Fine, 1999, Section 5.5.2) that is designed to operate using only noisy (and 
possibly) biased estimates of the gradient of the objective function (for example, the estimates 
provided by GPOMDP). The argument GRAD to CONJPOMDP computes the gradient estimate. 
The novel feature of CONJPOMDP is GSEARCH, a linesearch subroutine that uses only gradi- 
ent information to find the local maximum in the search direction. The use of gradient informa- 
tion ensures GSEARCH is robust to noise in the performance estimates. Both CONJPOMDP and 
GSEARCH can be applied to any stochastic optimization problem for which noisy (and possibly) 
biased gradient estimates are available. 

The argument sq to CONJPOMDP provides an initial step-size for GSEARCH. The argument e 
provides a stopping condition; when ||GRAD(^)|p falls below e, CONJPOMDP terminates. 

4.2 The GSEARCH algorithm 

The key to the successful operation of CONJPOMDP is the hnesearch algorithm GSEARCH (Al- 
gorithm 3). GSEARCH uses only gradient information to bracket the maximum in the direction 6*, 
and then quadratic interpolation to jump to the maximum. 

We found the use of gradients to bracket the maximum far more robust than the use of function 
values. To illustrate why this is so, in Figure 1 we have plotted a stylized view of the average reward 
r]{6) along some search direction 6* (labeled "/" in the figure), and its gradient in that direction 
Vri{d) ■ 6* (labeled "grad(/)"). There are two ways we could search in the direction 6* to bracket 
the maximum of ri{6) in that direction (at in this case), one using function values and the other 
using gradient estimates: 

1. Find three points 61^62, 6z, all lying in the direction 6* from 6, such that ry(^i) < 57(^2) and 
'q{6z) < r}{02). Assuming no overshooting, we then know the maximum must lie between 61 
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Algorithm 2 CONJPOMDP(GRAD, 6*, sq, e) 
1: Given: 

• GRAB: M-^ ^ M-^: a (possibly noisy and biased) estimate of the gradient of the objec- 
tive function to be maximized. 

• Starting pai^ameters 6 G M.^ (set to maximum on return). 

• Initial step size sq > 0. 

• Gradient resolution e. 



2 


g = h = GRAB (6') 


3 


while WqW^ > e do 


4 


GSEARCH(GRAB, 6, h, .sq, e) 


5 


A = GRAB (6') 


6 


7 = (A-5).A/|bf 


7 


h = A + jh 


8 


if /i • A < then 


9 


h = A 


10 


end if 


11 


g = A 


12 


end while 



and 63 and we can use the three points and quadratic interpolation to estimate the location of 
the maximum. 

2. Find two points 61 and 02 such that 'Vr]{6i) ■ 6* > and V'q{62) ■ 0* < 0, and again use 
quadratic interpolation (which con^esponds to linear interpolation of the gradients) to estimate 
the location of the maximum. 

Both of these approaches will be equally satisfactory provided there is no noise in either the function 
estimates ri{6), or the gradient estimates Vr/(0). However, when estimates of r]{9) or Vr]{6) are 
available only through simulation, they will necessarily be noisy and the situation will look more 
like Figure 2. In this case the use of gradients to bracket the maximum becomes more desirable, 
because the line-search technique based on value estimates could choose any of the peaks in the 
plot of / + noise as the location of the maximum, which occur nearly uniformly along the a;-axis, 
whereas the second technique based on gradients would choose any of the zero-crossings of the 
noisy gradient plot, which are far closer to the tme maximum^. This is illustrated in Figure 3. 

Another view of this phenomenon is that regardless of the variance of our estimates of ri{6), the 
variance of sign [r]{6i) — rj{62)] approaches 1 (the maximum possible) as 9i approaches 02- Thus, 
to reliably bracket the maximum using noisy estimates of r]{6) we need to be able to reduce the 
variance of the estimates when 61 and 62 are close. In our case this means running the simulation 

2. There is an implicit assumption in our argument that the noise processes in the gradient and value estimates are of 
approximately the same magnitude. If the variance of the value estimates is considerably smaller than the variance of 
the gradient estimates then we would expect bracketing with values to be superior. In all our experiments we found 
gradient bracketing to be superior. 
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Figure 1: Stylized plot of the average reward r]{6) and the gradient Vr]{9) ■ 6* ina search direction 

e*. 
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Figure 2: Plot as in Figure 1 but with estimation noise added to both the function and gradient 
curves. 



from which the estimates are derived for longer and longer periods of time. In contrast, the variance 
of sign Vr?(0i) • 6* (and sign V?y(^2) • ^*) is independent of the distance between 9i and 62, and in 
particular does not grow as the two points approach one another. 

One disadvantage to using gradient estimates to bracket is that it is not possible to detect extreme 
overshooting of the maximum. However, this can be avoided by using value estimates as a "sanity 
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Figure 3: Plot of the possible maximum locations that would be found by a line-search algorithm 
based on value estimates (/), and one based on gradient estimates (grad(/)), for the cui^ves 
in Figure 2. The zero-crossings in each case are the possible locations. Note that the 
gradient-based approach more accurately localizes the maximum. 



check" to determine if the value has dropped dramatically, and suitably adjusting the search if this 
occurs. 

In Algorithm 3, lines 5-25 bracket the maximum by finding a pai^ameter setting 6^ = Oq + 
S-6* such that GRAD(6_) • 6* > —e, and a second parameter setting 6+ = Bq + s+6* such that 
GRAD(^4.) < e. The reason for e rather than in these expressions is to provide some robustness 
against eiTors in the estimates GR AD(6). It also prevents the algorithm "stepping to oo" if there is 
no local maximum in the direction 6*. Note that we use the same e as used in CONJPOMDP to 
determine when to terminate due to small gradient (line 4 in CONJPOMDP). 

Provided that the signs of the gradients at the bracketing points 6_ and 0+ show that the maxi- 
mum of the quadratic defined by these points lies between them, line 27 will jump to the maximum. 
Otherwise the algorithm simply jumps to the midpoint between 0_ and 

4.3 On-line optimization of the average reward: OLPOMDP 

CONJPOMDP combined with GSEARCH operates by iteratively choosing "uphill" directions and 
then searching for a local maximum in the chosen direction. If the GRAD argument to CONJPOMDP 
is GPOMDP, the optimization will involve many iterations of the underlying POMDP between pa- 
rameter updates. 

In traditional stochastic optimization one typically uses algorithms that update the pai^ameters 
at every iteration, rather than accumulating gradient estimates over many iterations. Algorithm 4, 
OLPOMDP, presents an adaptation of GPOMDP to this form. See Bartlett and Baxter (2000b) for a 
proof that OLPOMDP converges to the vicinity of a local maximum of 'r]{9). Note that OLPOMDP 
is very similar to the algorithms proposed in Kimura et al. (1995, 1997). 



360 



Policy-Gradient Estimation 



Algorithm 3 GSEARCH(GRAD, 6*0, f*, sq, e) 



1: Given: 

• GRAD : M.^ — >■ M.^ : a (possibly noisy and biased) estimate of the gradient of the objec- 
tive function. 

• Starting parameters £ (set to maximum on return). 

• Search direction 6* G M^'' with GRAD (610) ■ 9* > 0. 

• Initial step size sq > 0. 

• Inner product resolution e >= 0. 

5 = So 

6 = 00 + se* 
A = GRAD(6I) 
if A • 61* < then 

Step back to bracket the maximum: 
repeat 

.3+ = S 

p+ = A-e* 

s = s/2 
6 = 00 + s6* 
A = GRAD(6l) 
until A-e* > -e 
S- = s 
p^ = A-9* 
else 

Step forward to bracket the maximum: 
repeat 

.s_ = s 
P-=A-6* 
s = 2s 
6 = 00 + sB* 
A = GRAD(6') 
until A-e* <e 
s+ = s 

p+ = A.d* 

end if 

if > and p+ < then 

S+—S- 

S = S- — P-— 

^ P+-P- 

else 

_ S-+S + 

* ~ 2 

end if 

^0 = ^0 + se* 
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Algorithm 4 0LP0MDP(/3, T, 6*0 



1: Given: 

• /?e[o,i). 

• T > 0. 

• Initial parameter values £ ■ 

• Randomized pai^ameterized policies {fj,{6, ■): 6 & M.^ } satisfying Assumption 1. 

• POMDP with rewards satisfying Assumption 2, and which when controlled by 
generates stochastic matrices P{6) satisfying Assumption 3. 

• Step sizes 74 , i = 0, 1, . . . satisfying ^74 = 0x3 and ^ < 00. 

• Arbitrary (unknown) starting state Xq. 

2: Setzo = 0(zo e M^). 

3: for i = to T - 1 do 

4: Observe Yt (generated according to ^{Xt)). 
5: Generate control Ut according to ^i{6,Yt) 

6: Observe r{Xt+i) (where the next state Xt+i is generated according to pxtXt+i {Ut)- 



1: Set zt+i = Pzt + 



SetOt+i =et + ^tr{Xt+i)zt+i 
end for 
return 9^ 



5. Experiments 

In this section we present several sets of experimental results. Throughout this section, where we 
refer to CONJPOMDP we mean CONJPOMDP with GPOMDP as its GRAB argument. 

In the first set of experiments, we consider a system in which a controller is used to select 
actions for a 3-state Mai^kov Decision Process (MDP). For this system we are able to compute the 
true gradient exactly using the matrix equation 

V'n{0) = Tr'{e)VP{e) [I - P{e) + e7r'{e)] r, (3) 

where P{9) is the transition matrix of the underlying Markov chain with the controller's parameters 
set to 6, Ti' [d) is the stationary distribution coixesponding to P{6) (written as a row vector), en' {9) 
is the square matrix in which each row is the stationary distribution, and r is the (column) vector of 
rewards (see Baxter & Bartlett, 2001, Section 3, for a derivation of (3)). Hence we can compare the 
estimates At^ generated by GPOMDP with the true gradient \/rj{9), both as a function of the number 
of iterations T and as a function of the discount parameter /3. We also optimize the performance of 
the controller using the on-hne algorithm, OLPOMDP, and the off-line algorithm CONJPOMDP. 
CONJPOMDP reliably converges to a near optimal policy with around 100 iterations of the MDP, 
while the on-line method requires approximately 1000 iterations. This should be contrasted with 
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Table 1 : Transition probabilities of the three-state MDP 



r(^) = MA) = § MA) = fs 
r{B) = MB) = ^ MB) = ^ 
r{C) = l MC) = ^s MC) = ^s 
Table 2: Three-state rewards and features. 



training a linear value-function for this system using TD(1) (Sutton, 1988), which can be shown 
to converge to a value function whose one-step lookahead policy is suboptimal (Weaver & Baxter, 
1999). 

In the second set of experiments, we consider a simple "puck-world" problem in which a small 
puck must be navigated around a two-dimensional world by applying thrust in the x and y directions. 
We train a 1 -hidden-layer neural-network controller for the puck using CONJPOMDP. Again the 
controller reliably converges to near optimality. 

In the third set of experiments we use CONJPOMDP to optimize the admission thresholds for 
the call-admission problem considered in (Marbach, 1998). 

In the final set of experiments we use CONJPOMDP to train a switched neural-network con- 
troller for a two-dimensional variant of the "mountain-car" task (Sutton & Barto, 1998, Example 
8.2). 

In all the experiments we found that convergence of the line-seai^ches was greatly improved if 
all calls to the GPOMDP algorithm were seeded with the same random number sequence. 

5.1 A three-state MDP 

In this section we consider a three-state MDP, in each state of which there is a choice of two actions 
ai and 02 ■ Table 1 shows the transition probabilities as a function of the states and actions. Each 
state X has an associated two-dimensional feature vector ^(2;) = {(j)i{x),(j)2{x)) and reward r{x) 
which are detailed in Table 2. Clearly, the optimal policy is to always select the action that leads to 
state C with the highest probability, which from Table 1 means always selecting action 02- 

This rather odd choice of feature vectors for the states ensures that a value function linear in 
those features and trained using TD(1) — while observing the optimal policy — will implement a 
suboptimal greedy one-step lookahead policy (see (Weaver & Baxter, 1999) for a proof). Thus, in 
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contrast to the gradient based approach, for this system, TD(1) training a linear value function is 
guaranteed to produce a worse poUcy if it starts out observing the optimal poUcy. 



5.1.1 Training a controller 

Our goal is to learn a stochastic controller for this system that implements an optimal (or near- 
optimal) policy. Given a parameter vector 9 = (^i, ^2, ^3, ^4), we generate a policy as follows. For 
any state x, let 

si{x) := di4>i{x) + 624'2{x) 
Then the probability of choosing action ai in state x is given by 

gSl(x) 



gSl(x) _|_ gS2ix) ' 

while the probability of choosing action 02 is given by 

The ratios — ^fV^ needed by Algorithms 1 and 4 are given by, 

V/y (r) f>s2{x) 

^"^^ ' - [Mx),Mx),-Mx),-Mx)] (4) 



[-cf)i{x),-(f)2{x),(f)i(x),(f)2{x)] (5) 



Since the second two components in V/i//u are always the negative of the first two, this shows that 
two of the parameters are redundant in this case: we could just as well have set 6^ = —6i and 
64 = —62- 



5.1.2 Gradient ESTIMATES 

With a parameter vector^ of = [1, 1, —1, —1], GPOMDP was used to generate estimates At of 
Vf^r], for various values of T and /3 G [0, 1). To measure the progress of towards the true gradient 
V77, V?y was calculated from (3) and then for each value of T the angle between At^ and V?7 and 
the relative error were recorded. The angles and relative errors are plotted in Figures 4, 5 

and 6. 

The graphs illustrate a typical trade-off for the GPOMDP algorithm: small values of /3 give 
higher bias in the estimates, while larger values of /3 give higher variance (the final bias is only 
shown in Figure 6 for the norm deviation because it was too small to measure for the angular 
deviation). The bias introduced by having /3 < 1 is very small for this system. In the worst case, 
/3 = 0.0, the final gradient direction is indistinguishable from the true direction while the relative 
deviation H^J^^'^^II is only 7.7%. 

3. Other initial values of the parameter vector were chosen with similar results. Note that [1, 1, —1, —1] generates a 
suboptimal policy. 
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Figure 4: Angle between the true gradient Vr/ and tlie estimate for tlie tliree-state Markov 
cliain, for various values of the discount parameter (3. At was generated by Algorithm 1 . 
Averaged over 500 independent mns. Note the higher variance at large T for the larger 
values of f3. Error bars are one standai^d deviation. 



5.1.3 Training via conjugate-gradient ascent 

CONJPOMDP with GPOMDP as the "GRAB" argument was used to train the parameters of the 
controller described in the previous section. Following the low bias observed in the experiments of 
the previous section, the argument /3 of GPOMDP was set to 0. After a small amount of experimen- 
tation, the arguments sq and e of CONJPOMDP were set to 100 and 0.0001 respectively. None of 
these values were critical, although the extremely large initial step-size (sq) did considerably reduce 
the time required for the controller to converge to near-optimality. 

We tested the performance of CONJPOMDP for a range of values of the argument T to 
GPOMDP from 1 to 4096. Since GSEARCH only uses GPOMDP to determine the sign of the inner 
product of the gradient with the search direction, it does not need to run GPOMDP for as many 
iterations as CONJPOMDP does. Thus, GSEARCH determined its own T parameter to GPOMDP 
as follows. Initially, (somewhat ai^bitrarily) the value of T within GSEARCH was set to 1/10 the 
value used in CONJPOMDP (or 1 if the value in CONJPOMDP was less than 10). GSEARCH then 
called GPOMDP to obtain an estimate of the gradient direction. If At • ^* < {6* being the 
desired search direction) then T was doubled and GSEARCH was called again to generate a new 
estimate A^- This procedure was repeated until A^ ■ 6* > 0, or T had been doubled four times. If 
At ■ 0* was still negative at the end of this process, GSEARCH searched for a local maximum in 
the direction —6*, and the number of iterations T used by CONJPOMDP was doubled on the next 
iteration (the conclusion being that the direction 9* was generated by overly noisy estimates from 
GPOMDP). 
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Figure 5: A plot of for the three-state Markov chain, for various values of the discount 

parameter /3. was generated by Algorithm 1. Averaged over 500 independent runs. 
Note the higher variance at lai^ge T for the larger values of /3. Error bars are one standai^d 
deviation. 




0.001 ■ ■ ■ ■ ■ ■! 

1 10 100 1000 1 00001 00000 1 e+06 1e+07 
Markov Chain Iterations (T) 

Figure 6: Graph showing the error in the estimate At (as measured by ii^jl^^^) for various values 
of 13 for the three-state Markov chain. was generated by Algorithm 1. Note the 
decrease in the final bias as j3 increases. Both axes are log scales. 
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Figure 7: Performance of the 3-state Markov chain controller trained by CONJPOMDP as a func- 
tion of the total number of iterations of the Markov chain. The peifoiTnance was com- 
puted exactly from the stationary distribution induced by the controller. The average 
rewai'd of the optimal pohcy is 0.8. Averaged over 500 independent runs. The error bars 
were computed by dividing the results into two separate bins depending on whether they 
were above or below the mean, and then computing the standard deviation within each 
bin. 



Figure 7 shows the average reward ri{6) of the final controller produced by CONJPOMDP, as a 
function of the total number of simulation steps of the underlying Markov chain. The plots represent 
an average over 500 independent runs of CONJPOMDP. Note that 0.8 is the average reward of the 
optimal policy. The parameters of the controller were (uniformly) randomly initialized in the range 
[-0.1, 0.1] before each call to CONJPOMDP. After each call to CONJPOMDP, the average reward 
of the resulting controller was computed exactly by calculating the stationary distribution for the 
controller. From Figure 7, optimality is reliably achieved using approximately 100 iterations of the 
Markov chain. 

5.1.4 Training ON-LINE WITH OLPOMDP 

The controller was also trained on-line using Algorithm 4 (OLPOMDP) with fixed step-sizes 7t = c 
with c = 0.1, 1, 10, 100. Reducing step-sizes of the form 7^ = c/i were tried, but caused intolerably 
slow convergence. Figure 8 shows the performance of the controller (measured exactly as in the 
previous section) as a function of the total number of iterations of the Markov chain, for different 
values of the step-size c. The graphs are averages over 100 runs, with the controller's weights 
randomly initialized in the range [—0.1, 0.1] at the start of each run. From the figure, convergence 
to optimal is about an order of magnitude slower than that achieved by CONJPOMDP, for the best 
step-size of c = 1.0. Step-sizes much greater that c = 10.0 failed to reliably converge to an optimal 
policy. 
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Figure 8: Performance of the 3-state Markov chain controller as a function of the number of itera- 
tion steps in the on-line algorithm, Algorithm 4, for fixed step sizes of 0.1, 1, 10, and 100. 
Error bars were computed as in Figure 7. 



5.2 Puck World 

In this section, experiments are described in which CONJPOMDP and OLPOMDP were used to 
train 1 -hidden-layer neural-network controllers to navigate a small puck around a two-dimensional 
world. 

5.2.1 The WORLD 

The puck was a unit-radius, unit-mass disk constrained to move in the plane in a region 100 units 
square. The puck had no internal dynamics (i.e rotation). Collisions with the region's boundaries 
were inelastic with a (tunable) coefficient of restitution e (set to 0.9 for the experiments reported 
here). The puck was controlled by applying a 5 unit force in either the positive or negative x 
direction, and a 5 unit force in either the positive or negative y direction, giving four different 
controls in total. The control could be changed every 1/10 of a second, and the simulator operated 
at a granularity of 1/100 of a second. The puck also had a retai^ding force due to air resistance of 
0.005 X speed^. There was no friction between the puck and the ground. 

The puck was given a reward at each decision point (1/10 of a second) equal to — d where d 
was the distance between the puck and some designated target point. To encourage the controller 
to learn to navigate the puck to the target independently of the starting state, the puck state was 
reset every 30 (simulated) seconds to a random location and random x and y velocities in the range 
[—10, 10], and at the same time the target position was set to a random location. 

Note that the size of the state-space in this example is essentially infinite, being of the order of 
2PRECIS10N \vhere PRECISION is the floating point precision of the machine (64 bits). Thus, the 
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time between visits to a recurrent state is likely to be large. Also, the puck cannot just maximize its 
immediate reward because this leads to significant overshooting of the target locations. 

5.2.2 The controller 

A one-hidden-layer neural-network with six input nodes, eight hidden nodes and four output nodes 
was used to generate a probabilistic policy in a similar manner to the controller in the three-state 
Markov chain example of the previous section. Four of the inputs were set to the raw x and y 
locations and velocities of the puck at the current time-step, the other two were the differences 
between the puck's x and y location and the target's x and y location respectively. The location 
inputs were scaled to lie between —1 and 1, while the velocity inputs were scaled so that a speed 
of 10 units per second mapped to a value of 1. The hidden nodes computed a tanh squashing 
function, while the output nodes were linear. Each hidden and output node had the usual additional 
offset parameter. The four output nodes were exponentiated and then normalized as in the Markov- 
chain example to produce a probability distribution over the four controls (±5 units thrust in the x 
direction, ±5 units thrust in the y direction). Controls were selected at random from this distribution. 

5.2.3 Conjugate gradient ascent 

We trained the neural-network controller using CONJPOMDP with the gradient estimates generated 
by GPOMDP. After some experimentation we chose j3 = 0.95 and T = 1, 000, 000 as the param- 
eters CONJPOMDP supplied to GPOMDP. GSEARCH used the same value of /3 and the scheme 
discussed in Section 5.1.3 to detemine the number of iterations with which to call GPOMDP. 

Due to the saturating nature of the neural-network hidden nodes (and the exponentiated output 
nodes), there was a tendency for the network weights to converge to local minima at "infinity". 
That is, the weights would grow very rapidly eai^ly on in the simulation, but towards a suboptimal 
solution. Large weights tend to imply very small gradients and thus the network becomes "stuck" 
at these suboptimal solutions. We have observed a similar behaviour when training neural networks 
for pattern classification problems. To fix the problem, we subtracted a small quadratic penalty term 
7||^|P from the performance estimates and hence also a small correction 2'y0i from the gradient 
calculation^ for 6i. 

We used a decreasing schedule for the quadratic penalty weight 7 (arrived at through some 
experimentation). 7 was initialized to 0.5 and then on every tenth iteration of CONJPOMDP, if the 
performance had improved by less than 10% from the value ten iterations ago, 7 was reduced by a 
factor of 10. This schedule solved neai^ly all the local minima problems, but at the expense of slower 
convergence of the controller. 

A plot of the average rewai^d of the neural-network controller is shown in Figure 9, as a function 
of the number of iterations of the POMDP. The graph is an average over 100 independent runs, 
with the parameters initialized randomly in the range [—0.1, 0.1] at the start of each run. The four 
bad runs shown in Figure 10 were omitted from the average because they gave misleadingly large 
error bars. 

Note that the optimal performance (within the neural-network controller class) seems to be 
around —8 for this problem, due to the fact that the puck and target locations are reset every 30 
simulated seconds and hence there is a fixed fraction of the time that the puck must be away from 

4. When used as a technique for capacity control in pattern classification, this technique goes by the name "weight 
decay". Here we used it to condition the optimization problem. 
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Figure 9: Performance of the neural-network puck controller as a function of the number of itera- 
tions of the puck world, when trained using CONJPOMDP. Performance estimates were 
generated by simulating for 1, 000, 000 iterations. Averaged over 100 independent runs 
(excluding the four bad runs in Figure 10). 



the target. From Figure 9 we see that the final performance of the puck controller is close to optimal. 
In only 4 of the 100 runs did CONJPOMDP get stuck in a suboptimal local minimum. Three of 
those cases were caused by overshooting in GSEARCH (see Figure 10), which could be prevented 
by adding extra checks to CONJPOMDP. 

Figure 1 1 illustrates the behaviour of a typical trained controller. For the purpose of the illus- 
tration, only the target location and puck velocity were randomized every 30 seconds, not the puck 
location. 

5.3 Call Admission Control 

In this section we report the results of experiments in which CONJPOMDP was applied to the task 
of training a controller for the call admission problem treated by Marbach (1998, Chapter 7). 

5.3.1 The Problem 

The call admission control problem treated by Maibach (1998, Chapter 7) models the situation 
in which a telecommunications provider wishes to sell bandwidth on a communications link to 
customers in such a way as to maximize long-term average reward. 

Specifically, the problem is a queuing problem. There are three different types of call, each 
with its own call arrival rate «(!), a{2), q;(3), bandwidth demand 6(1), 6(2), 6(3) and average 
holding time h{l), h{2), h{3). The arrivals ai^e Poisson distributed while the holding times ai^e 
exponentially distributed. The link has a maximum bandwidth of 10 units. When a call arrives and 
there is sufficient available bandwidth, the service provider can choose to accept or reject the call 
(if there is not enough available bandwidth the call is always rejected). Upon accepting a call of 
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Figure 10: Plots of the performance of the neural-network puck controller for the four runs (out of 
100) that converged to substantially suboptimal local minima. 




Figure 1 1 : Illustration of the behaviour of a typical trained puck controller. 



type m, the service provider receives a reward of r{m) units. The goal of the service provider is to 
maximize the long-term average reward. 

The parameters associated with each call type are listed in Table 3. With these settings, the 
optimal policy (found by dynamic programming by Marbach (1998)) is to always accept calls of 
type 2 and 3 (assuming sufficient available bandwidth) and to accept calls of type 1 if the available 
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Call Type 



1 



2 



3 



Bandwidth Demand 
Arrival Rate 
Average Holding Time 
Reward 



a 



h 



r 



b 



1 

1.8 
0.6 
1 



1 

1.6 
0.5 
2 



1 

1.4 
0.4 
4 



Table 3: Parameters of the call admission control problem. 



bandwidth is at least 3. This policy has an average reward of 0.804, while the "always accept" 
policy has an average reward^ of 0.784. 

5.3.2 The Controller 

The controller had three parameters 6 = (^i , ^2, ^3)> one for each type of call. Upon arrival of a call 
of type m, the controller chooses to accept the call with probabiUty 



where b is the currently used bandwidth. This is the class of controllers studied by Marbach (1998). 
5.3.3 Conjugate gradient ascent 

CONJPOMDP was used to train the above controller, with GPOMDP generating the gradient es- 
timates from a range of values of (3 and T. The influence of fj on the performance of the trained 
controllers was mai^ginal, so we set /3 = 0.0 which gave the lowest-variance estimates. We used 
the same value of T for calls to GPOMDP within CON.JPOMDP and within GSEARCH, and this 
was varied between 10 and 10, 000. The controller was always started from the same parameter 
setting = (8, 8, 8) (as was done by Marbach (1998)). The value of this initial policy is 0.691. The 
graph of the average reward of the final controller produced by CONJPOMDP as a function of the 
total number of iterations of the queue is shown in Figure 12. A performance of 0.784 was reliably 
achieved with less than 2000 iterations of the queue. 

Note that the optimal policy is not achievable with this controller class since it is incapable 
of implementing any threshold policy other than the "always accept" and "always reject" policies. 
Although not provably optimal, a parameter setting of « 7.5 and any suitably lai^ge values of 62 
and 6s generates something close to the optimal policy within the controller class, with an average 
reward of 0.8. Figure 13 shows the probability of accepting a call of each type under this policy 
(with 62 = 63 = 15), as a function of the available bandwidth. 

The controllers produced by CONJPOMDP with (3 = 0.0 and sufficiently large T are essentially 
"always accept" controllers with an average reward of 0.784, within 2% of the optimum achievable 
in the class. To produce policies even nearer to the optimal policy in performance, CONJPOMDP 
must keep 61 close to its starting value of 8, and hence the gradient estimate = (Ai, A2, A3) 

5. There is some discrepancy between our average rewards and those quoted by Marbach (1998). This is probably due 
to a discrepancy in the way the state transitions are counted, which was not clear from the discussion in (Marbach, 




1998). 
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ure 12: Performance of the call admission controller trained by CONJPOMDP as a function of 
the total number of iterations of the queue. The performance was computed by simu- 
lating the controller for 100,000 iterations. The average reward of the globally optimal 
policy is 0.804, the average reward of the optimal policy within the class is 0.8, and 
the plateau performance of CONJPOMDP is 0.784. The graphs are averages from 100 
independent runs. 
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ure 13: Probability of accepting a call of each type under the call admission policy with near- 
optimal pai^ameters 61 = 7.5, 62 = ^3 = 15. Note that calls of type 2 and 3 ai^e 
essentially always accepted. 



373 



Baxter et al. 





















- \ 


















































































































































Deltal 
Delta2 
Deltas 























0.1 0.2 O.S 0.4 0.5 0.6 0.7 0.8 0.9 1 
Beta 

Figure 14: Plot of the three components of At^ for the call admission problem, as a function of the 
discount parameter /3. The parameters were set at ^ = (8, 8,8). T was set to 1, 000, 000. 
Note that Ai does not become negative (the correct sign) until /3 « 0.93. 



produced by GPOMDP must have a relatively small first component. Figure 14 shows a plot of 
normalized A^ as a function of /3, for T = 1, 000, 000 (sufficiently large to ensure low variance 
in At) and the starting parameter setting 9 = (8, 8, 8). From the figure, Ai starts at a high value 
which explains why CONJPOMDP produces "always accept" controllers for /3 = 0.0, and does not 
become negative until /3 ss 0.93, a value for which the variance in Ay even for moderately large T 
is relatively high. 

A plot of the performance of CONJPOMDP for ^ = 0.9 and /3 = 0.95 is shown in Figure 
15. Approximately half of the remaining 2% in performance can be obtained by setting /3 = 0.9, 
while for /3 = 0.95 a sufficiently large choice for T gives most of the remaining performance. For 
this problem, there is a huge difference between gaining 98% of optimal performance, which is 
achieved for /3 = 0.0 and less than 2000 iterations of the queue, and gaining 99% of the optimal 
which requires /3 = 0.9 and of the order of 500,000 queue iterations. A similar convergence rate 
and final approximation error to the latter case were reported for the on-line algorithms by Marbach 
(1998, Chapter 7). 

5.4 Mountainous Puck World 

The "mountain-car" task is a well-studied problem in the reinforcement learning literature (Sutton 
& Barto, 1998, Example 8.2). As shown in Figure 16, the task is to drive a car to the top of a one- 
dimensional hill. The car is not powerful enough to accelerate directly up the hill against gravity, so 
any successful controller must learn to "oscillate" back and forth until it builds up enough speed to 
crest the hill. 

In this section we describe a variant of the mountain car problem based on the puck-world 
example of Section 5.2. With reference to Figure 17, in our problem the task is to navigate a puck 
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Figure 15: Performance of the call admission controller trained by CONJPOMDP as a function 
of the total number of iterations of the queue. The performance was calculated by 
simulating the controller for 1,000,000 iterations. The graphs are averages from 100 
independent mns. 




out of a valley and onto a plateau at the northern end of the valley. As in the mountain-car task, the 
puck does not have sufficient power to accelerate directly up the hill, and so has to learn to oscillate 
in order to climb out of the valley. Once again we were able to reliably train neai^-optimal neural- 
network controllers for this problem, using CON.JPOMDP and GSEARCH, and with GPOMDP 
generating the gradient estimates. 

5.4.1 The WORLD 

The world dimensions, physics, puck dynamics and controls were identical to the flat puck world 
described in Section 5.2, except that the puck was subject to a constant gravitational force of 10 
units, the maximum allowed thrust was 3 units (instead of 5), and the height of the world varied as 
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Figure 17: In our variant of the mountain-car problem the task is to navigate a puck out of a valley 
and onto the northern plateau. The puck starts at the bottom of the valley and does not 
have enough power to drive directly up the hill. 



With only 3 units of thrust, a unit mass puck can not accelerate directly out of the valley. 

Every 120 (simulated) seconds, the puck was initialized with zero velocity at the bottom of 
the valley, with a random x location. The puck was given no reward while in the valley or on the 
southern plateau, and a reward of 100 — .s^ while on the northern plateau, where s was the speed 
of the puck. We found the speed penalty helped to improve the rate of convergence of the neural 
network controller. 

5.4.2 The controller 

After some experimentation we found that a neural-network controller could be reliably trained to 
navigate to the northern plateau, or to stay on the northern plateau once there, but it was difficult to 
combine both in the same controller (this is not so surprising since the two tasks are quite distinct). 
To overcome this problem, we trained a "switched" neural-network controller: the puck used one 
controller when in the valley and on the southern plateau, and then switched to a second neural- 
network controller while on the northern plateau. Both controllers were one-hidden-layer neural- 
networks with nine input nodes, five hidden nodes and four output nodes. The nine inputs were the 
normalized ([—1, l]-valued) x, y and z puck locations, the normalized x, y and z locations relative 
to center of the northern wall, and the x, y and z puck velocities. The four outputs were used to 
generate a policy in the same fashion as the controller of Section 5.2.2. 

An approach requiring less prior knowledge would be to have a third controller that stochasti- 
cally selects the base neural network controller as a function of the puck's location. This "master" 
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Figure 18: Performance of the neural-network puck controller as a function of the number of itera- 
tions of the mountainous puck world, when trained using CONJPOMDP. Performance 
estimates were generated by simulating for 1,000,000 iterations. Averaged over 100 
independent mns. 



controller could itself be parameterized and have its parameters trained along with the base con- 
trollers. 

5.4.3 Conjugate gradient ascent 

The switched neural-network controller was trained using the same scheme discussed in Sec- 
tion 5.2.3, except this time the discount factor /3 was set to 0.98. 

A plot of the average reward of the neural-network controller is shown in Figure 18, as a function 
of the number of iterations of the POMDP. The graph is an average over 100 independent runs, with 
the neural-network controller parameters initialized randomly in the range [—0.1, 0.1] at the start of 
each run. In this case no run failed to converge to near-optimal performance. From the figure we 
can see that the puck's performance is nearly optimal after about 40 million total iterations of the 
puck world. Although this figure may seem rather high, to put it in some perspective note that a 
random neural-network controller takes about 10,000 iterations to reach the northern plateau from a 
standing start at the base of the valley. Thus, 40 million iterations is equivalent to only about 4,000 
trips to the top for a random controller. 

Note that the puck converges to a final average performance around 75, which indicates it is 
spending at least 75% of its time on the northern plateau. Observation of the puck's final behaviour 
shows it behaves nearly optimally in terms of oscillating back and forth to get out of the valley. 

5.5 Choosing /3 and the Running Time of GPOMDP 

One aspect of these experiments that required some measure of tuning is the choice of the (5 parame- 
ter and running time T used by GPOMDP. Although these were selected by trial and error, we have 
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had some success recently with a scheme for automatically choosing these parameters as follows. 
Before any training begins, GPOMDP is run for a large number of iterations whilst simultaneously 
generating gradient estimates for a number of different choices of /3. This can be done from a single 
simulation simply by maintaining a separate eligibility trace zt for each value of /3. Since the bias 
reduces with increasing /3, the largest /3 that gives a reasonably low-variance gradient estimate at the 
end of the long run is selected as a "reference" /3 (the variance is estimated by comparing gradient 
estimates at reasonably well-separated intervals towards the end of the run). Furthermore, since 
the variance of the gradient estimate decreases as /3 decreases, all gradient estimates for values of (3 
smaller than the reference (5 will typically have smaller variance than that of the reference fi. Hence, 
we can reliably compare the directions for smaller /3's with the direction given by the reference /3, 
and choose the smallest [3 whose coiTcsponding direction is sufficiently close to the reference /3 
direction. We take"sufficiently close" to mean within 10°-15°. 

Note that this scheme only works if the original run is sufficiently long to get a low-variance 
direction estimate at the right value of /3. If the right value of /3 is too large then any fixed bound on 
the run length can be made to fail, but this will be a problem for all algorithms that automatically 
choose fi. 

Once a suitable /3 has been found, we can go back and find the point in the original long run 
where the direction estimate corresponding to that value of /3 "settled down" (again, we measure 
the variance of the estimates by sampling at suitably large intervals, and choose a point where the 
variance falls below some chosen value). This time is then used as the mnning time T for GPOMDP 
when estimating the gradient direction. Finally, the mnning time used in GPOMDP when bracketing 
the maximum in GSEARCH can also be automatically tuned by starting with an initial fixed running 
time that is a fraction of T, and then continuing until the sign of the inner product of the estimates 
produced by GPOMDP with the search direction "settles down". With this technique, the sign 
estimation time is usually considerably smaller than the gradient direction estimation time. 

Another useful heuristic is to re-estimate /3 and GPOMDP's running time T whenever the pa- 
rameters change by a large amount, since a large change in 9 can lead to significant changes in the 
mixing time of the POMDP. 

6. Conclusion 

This paper showed how to use the peifomiance gradient estimates generated by the GPOMDP al- 
gorithm (Baxter & Bartlett, 2001) to optimize the average reward of parameterized POMDPs. We 
described both a traditional "on-line" stochastic gradient algorithm and an "off-line" approach that 
relied on the use of GSEARCH, a robust line-search algorithm that uses gradient estimates, rather 
than value estimates, to bracket the maximum. The off-line approach in particular was found to per- 
form well on four quite distinct problems: optimizing a controller for a three-state MDP, optimizing 
a neural-network controller for navigating a puck around a two-dimensional world, optimizing a 
controller for a call admission problem, and optimizing a switched neural-network controller in a 
variation of the classical mountain-car task. One reason for the superiority of the off-line approach 
is that by searching for a local maximum at each step it makes much more aggressive use of the 
gradient information than does the on-line algorithm. 

For the three-state MDP and the call-admission problems we were able to provide graphic illus- 
trations of how the bias and variance of the gradient estimates Vpr] can be traded against one another 
by varying /3 between (low variance, high bias) and 1 (high variance, low bias). 
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Relatively little tuning was required to generate these results. In addition, the controllers oper- 
ated on direct and simple representations of the state, in contrast to the more complex representations 
usually required of value-function based approaches. 

It is often the case that value-function methods converge much more rapidly than their policy- 
gradient counterparts. This is due to the fact that they enforce constraints on the value-function. 
With this in mind an interesting avenue for further research is Actor-Critic algorithms (Barto et al., 
1983; Baird & Moore, 1999; Kimura & Kobayashi, 1998; Konda & Tsitsiklis, 2000; Sutton, 
McAUester, Singh, & Mansour, 2000) in which one attempts to combine the fast convergence of 
value-functions with the theoretical guarantees of policy -gradient approaches. 

Despite the success of the off-line approach in the experiments described here, the on-line algo- 
rithm has advantages in other settings. In particular, when it is applied to multi-agent reinforcement 
learning, both gradient computations and parameter updates can be performed for distinct agents 
without any communication beyond the global distribution of the rewar d signal. This idea has led to 
a parameter optimization procedure for spiking neural networks, and some successful preliminary 
results with network routing (Bartlett & Baxter, 1999; Tao, Baxter, & Weaver, 2001). 
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